Clitics in Arabic Language: A Statistical Study
نویسندگان
چکیده
Clitics in Arabic language can be attached to a stem or to each other without orthographic marks such as an apostrophe. In this paper we present a statistical study of clitics and its effect in Arabic language. We tokenize large Arabic text using white-spaces and an automatic clitics tokenizer (AMIRA 2.0) and compare the unique-word count in both cases with English language. We also show the resulted distribution of clitics in Arabic and examine the performance of the used tokenizer. Using a 600 million words Arabic corpus, we report that the corresponding lexicon size could be reduced by 24.54% when applying clitics tokenization.
منابع مشابه
Constructing and Using Broad-coverage Lexical Resource for Enhancing Morphological Analysis of Arabic
Broad-coverage language resources which provide prior linguistic knowledge must improve the accuracy and the performance of NLP applications. We are constructing a broad-coverage lexical resource to improve the accuracy of morphological analyzers and part-of-speech taggers of Arabic text. Over the past 1200 years, many different kinds of Arabic language lexicons were constructed; these lexicons...
متن کاملDialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation
This paper is about improving the quality of Arabic-English statistical machine translation (SMT) on dialectal Arabic text using morphological knowledge. We present a light-weight rule-based approach to producing Modern Standard Arabic (MSA) paraphrases of dialectal Arabic out-of-vocabulary (OOV) words and low frequency words. Our approach extends an existing MSA analyzer with a small number of...
متن کاملAbstracts/Journal of the Arabic Language and LiteratureVol.14, No48, autumn 2018
Contents The Representation of Culture in Arabic pedagogy books to non-Arabic languages Danesh Mohammadi, Sakineh Zarenejad....................................................... 1 Critical Study of themanifestations of Mamluke's life from the novel “Alsaerounniyam...
متن کاملFormulation of Language Teachers̕ Identity in the Situated Learning of Language Teaching Community of Practice
A community of practice may shape and reshape the identity of members of the community through providing them with situated learning or learning environment. This study, therefore, is to clarify the salient learning-based features of the language teaching community of practice that might formulate the identity of language teachers. To this end, the study examined how learning situations in two ...
متن کاملAutomatic Headline Generation using Character Cross-Correlation
Arabic language is a morphologically complex language. Affixes and clitics are regularly attached to stems which make direct comparison between words not practical. In this paper we propose a new automatic headline generation technique that utilizes character cross-correlation to extract best headlines and to overcome the Arabic language complex morphology. The system that uses character cross-...
متن کامل